This is the final project submission for IS608, a required course in the Master's Degree in Data Analytics program at SPS CUNY. A list of projects from my fellow classmates can be found here. All the code and data is hosted on Github. Checkout the final visualization, if you want to skip over the explanation below.
The goal of this project is to create choropleth map of energy consumption on a per-county level. Data from the EIA gives data on utility level demand and what counties utilities deliver power to. We use data, plus Census data on population, to come up with an estimate of per-county energy consumption. This methodology will be explained below.
Here is some pseudo-code decribing the whole process, from raw data to final map output
Since the data from the EIA is organized such that utilties only have to report their total numbers, we do not have direct "county to consumption" numbers. This needs to be estimated from what we have, and here is my attempt at doing so.
Given a certain county, there will be an associated number of utilities that provide energy to it. Conversely, we also have for a given utility, a list of counties it provides energy for. From the census data, we can then get an idea of how many people a utility servers by summing the population across counties it serves. This is an upper estimation, since the utility more than likely does not serve 100% of a county (due to competition, etc).
What we have from data is the total amount of consumption per utility, and now we also have a total population that the utility potentially serves. Given this, as an estimation, we can say that a given county consumes energy based on a population ratio of the county's population to the total amount of people served by the utility.
The final estimation of a county's usage is to sum these contributions across all utilities a given county is served by.
In [6]:
### Globals and Constants
import csv
from collections import defaultdict
# Column indicies for EIA data
UTILITY = 1
STATE = 3
COUNTY = 4
# Column indicies for EIA retail data
UTILITY_ID = 1
CONSUMPTION = 21
SHORT_CONSUMPTION = 5
# Column indicies for Census data
ID = 1
DESC = 2
POP2012 = 7
# Constants
YEAR = 2012
PATH_TO_SERVICE_DATA = "data/f8612012/service_territory_%s.csv" % YEAR # Path to the defined utility service territories
PATH_TO_RETAIL_DATA = "data/f8612012/retail_sales_%s.csv" % YEAR # Path to the defined utility service territories
PATH_TO_SHORTRETAIL_DATA = "data/f8612012/short_form_%s_Changed.csv" % YEAR # Path to the defined utility service territories
PATH_TO_POPULATION_DATA = "data/PEP_2013_PEPANNRES/PEP_2013_PEPANNRES_with_ann_with_changes.csv"
STATES = { 'AK': 'Alaska','AL': 'Alabama','AR': 'Arkansas','AS': 'American Samoa','AZ': 'Arizona','CA': 'California','CO': 'Colorado','CT': 'Connecticut','DC': 'District of Columbia','DE': 'Delaware','FL': 'Florida','GA': 'Georgia','GU': 'Guam','HI': 'Hawaii','IA': 'Iowa','ID': 'Idaho','IL': 'Illinois','IN': 'Indiana','KS': 'Kansas','KY': 'Kentucky','LA': 'Louisiana','MA': 'Massachusetts','MD': 'Maryland','ME': 'Maine','MI': 'Michigan','MN': 'Minnesota','MO': 'Missouri','MP': 'Northern Mariana Islands','MS': 'Mississippi','MT': 'Montana','NA': 'National','NC': 'North Carolina','ND': 'North Dakota','NE': 'Nebraska','NH': 'New Hampshire','NJ': 'New Jersey','NM': 'New Mexico','NV': 'Nevada','NY': 'New York','OH': 'Ohio','OK': 'Oklahoma','OR': 'Oregon','PA': 'Pennsylvania','PR': 'Puerto Rico','RI': 'Rhode Island','SC': 'South Carolina','SD': 'South Dakota','TN': 'Tennessee','TX': 'Texas','UT': 'Utah','VA': 'Virginia','VI': 'Virgin Islands','VT': 'Vermont','WA': 'Washington','WI': 'Wisconsin','WV': 'West Virginia','WY': 'Wyoming'}
# Global variables
countyToUtility = {} # Mapping from county number to a list of utilities serving it
utilityToCounty = {} # Mapping from utility id to a list of counties it serves
countyToPopulation = {} # Mapping from county number to a population from census data
nameToID = {} # Mapping from county name to the county code
utilityToConsumption = dict() # Mapping the utility id to the total consumption in mWh
utilityToPopulation = dict() # Mapping the utility id to the total county population it serves
countyToConsumption = dict() # Mapping the final result of county ID to consumption in mWh
This function reads in the census data, and loads a dictionary that maps the county name to its county ID (nameToID), and another dictionary that maps from the county ID to the population estimate in 2012 (countyToPopulation). The county name is constructed as "state_countyname", all lowercase
In [7]:
def loadCensusData():
''' Loads the mapping from county number to population into
'countyToUPopulation' and a name to ID mapping from the PATH_TO_POPULATION_DATA file. '''
f = open(PATH_TO_POPULATION_DATA)
reader = csv.reader(f)
reader.next()
reader.next()
for row in reader:
# Grab the important parts of the data
id = int(row[ID])
desc = row[DESC]
pop2012 = row[POP2012]
# Derive the 'county key' from the description column
(county, state) = desc.split(',')
county = county.lower().replace("county", "").replace(".", "").replace(" ", "")
state = state.lower().replace(' ', '')
key = state + '_' + county
# correction to Lousiana county names
if state == "louisiana":
key = key.replace("parish", "")
# Setup the two mappings
nameToID[key] = id
countyToPopulation[id] = float(pop2012)
def loadCountytoUtilityData():
''' Loads the mapping from county number to utilities into
'countyToUtility' from the PATH_TO_SERVICE_DATA file. '''
f = open(PATH_TO_SERVICE_DATA)
reader = csv.reader(f)
reader.next()
for row in reader:
state = STATES[ row[STATE].upper() ].lower().replace(' ', '')
county = row[COUNTY].lower().replace(' ', '').replace('.', '')
key = state + '_' + county
utilityID = int(row[UTILITY])
try:
if nameToID[key] in countyToUtility:
countyToUtility[nameToID[key]].add(utilityID)
else:
#if key not in nameToID:
# print "key %s not found" % key
countyToUtility[nameToID[key]] = set([utilityID])
if utilityID in utilityToCounty:
utilityToCounty[utilityID].add(nameToID[key])
else:
utilityToCounty[utilityID] = set([nameToID[key]])
except Exception as e:
pass
#print "Exception: %s" % key
def loadUtilityConsumptionData():
''' Load the retail data as mapping from utility ID
to total level of consumption in mWh. '''
# Open file as CSV and skip three header lines
f = open(PATH_TO_RETAIL_DATA)
reader = csv.reader(f)
reader.next()
reader.next()
reader.next()
# For each row in the reader ...
for row in reader:
# Read in the ID and consumption total level
id = int(row[UTILITY_ID])
consumption = float(row[CONSUMPTION].replace(',', ''))
# Update global mapping
utilityToConsumption[id] = consumption
f = open(PATH_TO_SHORTRETAIL_DATA)
reader = csv.reader(f)
reader.next()
# For each row in the reader ...
for row in reader:
# Read in the ID and consumption total level
id = int(row[UTILITY_ID])
consumption = float(row[SHORT_CONSUMPTION].replace(',', ''))
# Update global mapping
if id in utilityToConsumption:
print "Whoa, we have a duplicate utility id: %d" % id
utilityToConsumption[id] = consumption
def deriveUtilityPopulation():
''' Derive the total number of people in all counties served by a utility. '''
# Loop through all utility companies and the list of counties served by them
for utility, counties in utilityToCounty.items():
# Calculate total population for all counties
totPopulation = 0
for county in counties:
totPopulation += countyToPopulation[county]
# Save total population in mapping
utilityToPopulation[utility] = float(totPopulation)
def calculateCountyConsumption():
''' Apply our methodology and generate a mapping of county to total energy consumption in mWh. '''
countiesWithErrors = 0
for county, utilities in countyToUtility.items():
# Since we are using defaultdicts, if we try to access data on a utility we do not have,
# it'll be counted as 0. See 'Issues' section above.
#countyToConsumption[county] = sum([((countyToPopulation[county] / utilityToPopulation[utility]) *
# utilityToConsumption[utility]) for utility in utilities])
countySum = 0
errorCount = 0
for utility in utilities:
try:
countySum += ((countyToPopulation[county] / utilityToPopulation[utility]) * utilityToConsumption[utility])
except Exception as e:
errorCount+=1
#print "Error for county %d and utility %d" % (county, utility)
if errorCount != 0:
countiesWithErrors+=1
#print "Number of errors: %d / %d" % (errorCount, len(utilities))
countyToConsumption[county] = countySum
#print "Counties with errors: %d / %d" % (countiesWithErrors, len(countyToUtility.keys()))
# Start by loading the census data and the service territory mapping
loadCensusData()
loadCountytoUtilityData()
loadUtilityConsumptionData()
deriveUtilityPopulation()
calculateCountyConsumption()
In [8]:
# Print random sample of county name to list of utility IDs
for county, utilityList in countyToUtility.items()[0:10]:
print "%s is served by utility IDs %s" % (county, list(utilityList))
In [9]:
# Print random sample of county name to list of utility IDs
for name, countyID in nameToID.items()[0:10]:
print "%s (%s) had a population of %s" % (name, countyID, countyToPopulation[countyID])
In [10]:
# Print random sample of utility ID to consumption data
for utilityID, consumption in utilityToConsumption.items()[0:10]:
if consumption != 0:
print "Utility ID %s in 2012 had %s mWhs consumed" % (utilityID, consumption)
Please see:
In [6]:
from IPython.display import HTML
HTML('<iframe src=http://jquacinella.github.io/IS608_Project/map.html width=100% height=750></iframe>')
#HTML('<iframe src="file:///home/james/Code/Masters/IS608/Project/map.html" width=100% height=1000></iframe>')
Out[6]:
Lets also check out this map when we normalize by population. This will give us a view into avergae individual consumption by county:
In [7]:
from IPython.display import HTML
HTML('<iframe src=http://jquacinella.github.io/IS608_Project/map_norm.html width=100% height=750></iframe>')
#HTML('<iframe src="file:///home/james/Code/Masters/IS608/Project/map.html" width=100% height=1000></iframe>')
Out[7]:
The major difficulty in generating the above maps was determing a good color scale and the appropriate range to use for the legend. The consumption, as can be seen in the google charts example, the underlying distribution is skewed and has a long tail. Making a linear range will inevitably leave things out. Possibly using a log-scale would help, but I am not sure if that works well with coloring. Right now, many of the extreme outliers are in the same grouping, which allows us to see more detail in the map.
The above maps show an interesting pattern: when comparing the maps, we see a distinct difference: big cities, especially on the coasts have bigger total consumption (which is expected with higher populations); the middle of the country have lower total consumption but have higher per-capita consumption. That means that cities seem to be more effcient with regards to how much any given person consumes electricity. This could be due to many things:
Further investigation would be needed to figure out why much of the middle of the country has such higher individual usage.
In [ ]: